CUT&Tag 数据处理与分析教程 四:Spike-in 对 CUT&Tag 数据的校正 对数据的校正
CUT&Tag 数据处理与分析教程 一(官方手把手教学)
CUT&Tag 数据处理与分析教程 二:质控(不需要修剪 reads!不需要修剪 reads!不需要修剪reads!)和数据比对
CUT&Tag 数据处理与分析教程 三:BAM 文件统计(CUT&Tag 不建议去重不去重不去重)
在此章节我们将介绍怎么使用 Spike-in 来对数据进行校正,当然在我们很多情况下不会加入 Spike-in。所以对于 Spike-in 按需求操作即可。
IV. 比对结果筛选和文件格式转换
通过回帖质量来筛选回贴上的 reads(可选)
有些项目可能需要对回帖质量分数进行更严格的过滤。博客 How does bowtie2 assign MAPQ scores?[1]
通过实例详细讨论了 bowtie 是如何计算质量分数的。
MAPQ(x) = -10 * log10(P(x is mapped wrongly)) = -10 * log10(p)
MAPQ 值的范围为 0 到 37, 40 或 42。samtools view -q minQualityScore
将过滤所有低于用户定义的 minQualityScore
的回帖结果。
##== linux 命令 ==##
minQualityScore=2
samtools view -q $minQualityScore ${projPath}/alignment/sam/${histName}_bowtie2.sam >${projPath}/alignment/sam/${histName}_bowtie2.qualityScore$minQualityScore.sam
如果你实现了这个过滤,在以下步骤中,通过筛选 sam 文件 ${histName}_bowtie2.qualityScore$minQualityScore.sam
来替换${histName}_bowtie2.sam
文件格式转换(可选)
本节是为 Peak calling 和 可视化做准备 所必需的,其中有一些过滤和文件格式转换需要完成。
##== linux 命令 ==##
## Filter and keep the mapped read pairs
## 筛选和保留比对上的双端 reads
samtools view -bS -F 0x04 $projPath/alignment/sam/${histName}_bowtie2.sam $projPath/alignment/bam/${histName}_bowtie2.mapped.bam
## Convert into bed file format
## 将 BAM 文件转换为 bed 文件格式
bedtools bamtobed -i $projPath/alignment/bam/${histName}_bowtie2.mapped.bam -bedpe $projPath/alignment/bed/${histName}_bowtie2.bed
## Keep the read pairs that are on the same chromosome and fragment length less than 1000bp.
## 保留那些在同一条染色体且片段长度小于 1000bp 的双端 reads
awk '$1==$4 && $6-$2 < 1000 {print $0}' $projPath/alignment/bed/${histName}_bowtie2.bed $projPath/alignment/bed/${histName}_bowtie2.clean.bed
## Only extract the fragment related columns
## 仅提取片段相关的列
cut -f 1,2,6 $projPath/alignment/bed/${histName}_bowtie2.clean.bed | sort -k1,1 -k2,2n -k3,3n >$projPath/alignment/bed/${histName}_bowtie2.fragments.bed
评估重复性
为了研究重复之间和不同条件下的可重复性,基因组被分成 500 bp bin,每个 bin reads 计数的 log2 转换值的 皮尔逊相关性在重复数据集之间计算。多个重复和 IgG 对照数据集显示在层次化聚类关联矩阵中。
#== linux 命令 ==##
## We use the mid point of each fragment to infer which 500bp bins does this fragment belong to.
binLen=500
awk -v w=$binLen '{print $1, int(($2 + $3)/(2*w))*w + w/2}' $projPath/alignment/bed/${histName}_bowtie2.fragments.bed |\
sort -k1,1V -k2,2n |\
uniq -c |\
awk -v OFS="\t" '{print $2, $3, $1}' |\
sort -k1,1V -k2,2n >$projPath/alignment/bed/${histName}_bowtie2.fragmentsCount.bin$binLen.bed
reprod = c()
fragCount = NULL
for(hist in sampleList){
if(is.null(fragCount)){
fragCount = read.table(paste0(projPath, "/alignment/bed/", hist, "_bowtie2.fragmentsCount.bin500.bed"), header = FALSE)
colnames(fragCount) = c("chrom", "bin", hist)
}else{
fragCountTmp = read.table(paste0(projPath, "/alignment/bed/", hist, "_bowtie2.fragmentsCount.bin500.bed"), header = FALSE)
colnames(fragCountTmp) = c("chrom", "bin", hist)
fragCount = full_join(fragCount, fragCountTmp, by = c("chrom", "bin"))
}
}
M = cor(fragCount %>% select(-c("chrom", "bin")) %>% log2(), use = "complete.obs")
corrplot(M, method = "color", outline = T,
addgrid.col = "darkgray", order="hclust",
addrect = 3, rect.col = "black", rect.lwd = 3,
cl.pos = "b", tl.col = "indianred4",
tl.cex = 1, cl.cex = 1, addCoef.col = "black",
number.digits = 2, number.cex = 1,
col = colorRampPalette(c("midnightblue","white","darkred"))(100)
)
V. Spike-in 校正
这部分是可选的,但建议取决于您的实验方案。我们在前面节展示了 spike-in 基因组的连接以及对 spike-in 序列的汇总 。潜在的假设是, 对于一系列使用相同数量的细胞的样本,回帖到原基因组和大肠杆菌基因组的片段比例是相同的。由于这一假设,我们不会在实验之间或纯化的 pATn5 批次之间进行 标准化处理,因为它们携带的大肠杆菌 DNA 数量可能非常不同。为了避免在归一化数据中出现小分数,我们将换算系数 Scaling factor:S 定义为:
S = C / (fragments mapped to E. coli genome)
然后计算归一化覆盖率:
Normalized coverage = (primary_genome_coverage) * S
这个常数是一个任意的乘数,通常是 10,000。结果文件将相对较小的基因组覆盖 bedgraph 文件。
##== linux 命令 ==##
if [[ "$seqDepth" -gt "1" ]]; then
mkdir -p $projPath/alignment/bedgraph
scale_factor=`echo "10000 / $seqDepth" | bc -l`
echo "Scaling factor for $histName is: $scale_factor!"
bedtools genomecov -bg -scale $scale_factor -i $projPath/alignment/bed/${histName}_bowtie2.fragments.bed -g $chromSize > $projPath/alignment/bedgraph/${histName}_bowtie2.fragments.normalized.bedgraph
fi
换算系数:Scaling factor
##=== R 命令 ===##
scaleFactor = c()
multiplier = 10000
for(hist in sampleList){
spikeDepth = read.table(paste0(projPath, "/alignment/sam/bowtie2_summary/", hist, "_bowtie2_spikeIn.seqDepth"), header = FALSE, fill = TRUE)$V1[1]
histInfo = strsplit(hist, "_")[[1]]
scaleFactor = data.frame(scaleFactor = multiplier/spikeDepth, Histone = histInfo[1], Replicate = histInfo[2]) %>%
rbind(scaleFactor, .)
}
scaleFactor$Histone = factor(scaleFactor$Histone, levels = histList)
left_join(alignDupSummary, scaleFactor, by = c("Histone", "Replicate"))
##=== R 命令 ===##
## Generate sequencing depth boxplot
fig6A = scaleFactor %>% ggplot(aes(x = Histone, y = scaleFactor, fill = Histone)) +
geom_boxplot() +
geom_jitter(aes(color = Replicate), position = position_jitter(0.15)) +
scale_fill_viridis(discrete = TRUE, begin = 0.1, end = 0.9, option = "magma", alpha = 0.8) +
scale_color_viridis(discrete = TRUE, begin = 0.1, end = 0.9) +
theme_bw(base_size = 20) +
ylab("Spike-in Scalling Factor") +
xlab("")
normDepth = inner_join(scaleFactor, alignResult, by = c("Histone", "Replicate")) %>% mutate(normDepth = MappedFragNum_hg38 * scaleFactor)
fig6B = normDepth %>% ggplot(aes(x = Histone, y = normDepth, fill = Histone)) +
geom_boxplot() +
geom_jitter(aes(color = Replicate), position = position_jitter(0.15)) +
scale_fill_viridis(discrete = TRUE, begin = 0.1, end = 0.9, option = "magma", alpha = 0.8) +
scale_color_viridis(discrete = TRUE, begin = 0.1, end = 0.9) +
theme_bw(base_size = 20) +
ylab("Normalization Fragment Count") +
xlab("") +
coord_cartesian(ylim = c(1000000, 130000000))
ggarrange(fig6A, fig6B, ncol = 2, common.legend = TRUE, legend="bottom")
好了,前面虽然写的啰嗦了一点,但是这对于数据分析是不应该忽略的内容,稳扎稳打,我们不能往往急于求成跑流程。
下一节我们将讲述关于 Peak calling
。
参考资料
How does bowtie2 assign MAPQ scores?: http://biofinysics.blogspot.com/2014/05/how-does-bowtie2-assign-mapq-scores.html